Statistics for Psychologists 3.

Computer lab 01

Course Number: BTPS2040BA-E
Contact Information: Please contact me through email anytime:
Author

Kálmán Abari

Published

February 16, 2025

Worked Example 1

This example provides a step-by-step guide for analyzing the mlbootcamp5_train.xlsx dataset using Jamovi. The dataset contains health-related variables and aims to explore factors associated with cardiovascular disease. Follow these instructions to understand data preparation, univariate and bivariate descriptive statistics, and multivariate descriptive statistical methods.

Codebook - mlbootcamp5_train.xlsx

Source: Analyzing cardiovascular data

This is a dataset on cardiovascular disease. All of the dataset values were collected at the moment of medical examination. There are 3 types of input features:

  • Objective: factual information;
  • Examination: results of medical examination;
  • Subjective: information given by the patient.
Structure of the Dataset
Feature Variable Type Variable Value Type
Age Objective Feature age int (days)
Height Objective Feature height int (cm)
Weight Objective Feature weight float (kg)
Gender Objective Feature gender categorical code
Systolic blood pressure Examination Feature ap_hi int
Diastolic blood pressure Examination Feature ap_lo int
Cholesterol Examination Feature cholesterol 1: normal, 2: above normal, 3: well above normal
Glucose Examination Feature gluc 1: normal, 2: above normal, 3: well above normal
Smoking Subjective Feature smoke binary
Alcohol intake Subjective Feature alco binary
Physical activity Subjective Feature active binary
Presence or absence of cardiovascular disease Target Variable cardio binary

Step 1. Import the Dataset

  • Open Jamovi.
  • Click on OpenThis PCBrowse and load the mlbootcamp5_train.xlsx file.
Figure 1: Import data file

Step 2. Inspect the Dataset

2.1 Dimensions

To find out how many rows and columns (the “dimensions”) your dataset has in Jamovi, look at the bottom of the data grid in the Data tab (Row count), and in the Variable tab (Variables). You will see the total number of rows (observations) and the total number of columns (variables).

Figure 2: Number of rows
Figure 3: Number of columns

Alternatively, you can use the Descriptives module:

  1. Click on AnalysesExplorationDescriptives.

  2. Move all variables into the “Variables” box.

  3. The right panel will show details about each variable, and at the bottom, Jamovi typically displays the number of valid rows for each variable (and hence the total).

Figure 4: Row count via Descriptive statistics

For the mlbootcamp5_train.xlsx dataset (as provided on Kaggle), you should see:

  • Number of rows: 70,000

  • Number of columns: 13

Hence, the dataset’s dimensions are 70,000×13.

2.2 Variables names and descriptions

It’s important to review variable names and their descriptions so that your dataset is clear, readable, and ready for analysis. In Jamovi, you can check and modify variable names and descriptions by following these steps:

  1. Open the Variables tab: Go to the Variables tab.

  2. Rename and add a description: You can change the variable name (Name) to something more descriptive and add/edit the Description to provide more information about what the variable represents.

Below are the original variable names from the mlbootcamp5_train.xlsx and some suggested renaming for clarity, along with brief descriptions:

Original Name Suggested Name Description
id ParticipantID Unique identifier for each participant
age Age_days Age of the participant in days
gender Gender Gender of the participant (1 = female, 2 = male)
height Height_cm Participant’s height in centimeters
weight Weight_kg Participant’s weight in kilograms
ap_hi BloodPressure_Systolic Systolic blood pressure (in mmHg) from the participant’s last check-up
ap_lo BloodPressure_Diastolic Diastolic blood pressure (in mmHg) from the participant’s last check-up
cholesterol CholesterolLevel Cholesterol level category (1 = normal, 2 = above normal, 3 = well above normal)
gluc GlucoseLevel Glucose level category (1 = normal, 2 = above normal, 3 = well above normal)
smoke SmokingStatus Smoking status (0 = non-smoker, 1 = smoker)
alco AlcoholIntake Alcohol intake (0 = no, 1 = yes)
active PhysicalActivity Physical activity (0 = no, 1 = yes)
cardio CardiovascularDisease Whether the participant has been diagnosed with cardiovascular disease (0 = No, 1 = Yes)
Table 1: Recommended Variable Names

By renaming and describing your variables, you make the dataset more intuitive and easier to analyze or share with others.

Figure 5: Rename variables and add descriptions

Make sure that your variable descriptions also include the logic behind how each measurement or category was determined (e.g., what units were used, how categories were assigned, etc.) so that anyone reviewing the dataset can easily understand its structure and meaning.

Step 3. Variable types

When setting a dataset into Jamovi, it’s important to verify that each variable is assigned the correct Measure type. This ensures your analyses will treat the data appropriately. You can check and modify the measure type by selecting each variable in the Data tab and adjusting the Measure type in the sidebar (e.g., Continuous, Nominal, Ordinal, or ID). Below is a summary of recommended measure types for the variables in the mlbootcamp5_train.xlsx dataset:

Name Recommended Measure Type
ParticipantID ID
Age_days Continuous
Gender Nominal
Height_cm Continuous
Weight_kg Continuous
BloodPressure_Systolic Continuous
BloodPressure_Diastolic Continuous
CholesterolLevel Ordinal
GlucoseLevel Ordinal
SmokingStatus Nominal (binary)
AlcoholIntake Nominal (binary)
PhysicalActivity Nominal (binary)
CardiovascularDisease Nominal (binary)
Table 2: Recommended Measure Type

Remember to double-check that each variable’s measure type accurately reflects the nature of the data (e.g., use Continuous for numeric measurements with a large range, Ordinal for ranked categories, Nominal for labels/categories, and ID for unique identifiers).

Figure 6: Setting Measure Type

Step 4. Levels of categorical variables

4.1 Rename category levels

Renaming category levels in your variables is essential for clarity and interpretability. In Jamovi, you can rename levels by opening the Data tab, selecting the variable of interest, and editing the “Levels” (or “Labels”) in the sidebar. Properly labeled categories help ensure that others (and your future self!) can clearly understand your dataset and analysis outputs.

Below is an example table summarizing possible level renaming for some of the categorical variables:

Variable Original Levels Suggested Level Names
Gender 1, 2 1 = Female, 2 = Male
SmokingStatu 0, 1 0 = Non-smoker, 1 = Smoker
AlcoholIntake 0, 1 0 = Non-drinker, 1 = Drinker
PhysicalActivity 0, 1 0 = Inactive, 1 = Active
CholesterolLevel 1, 2, 3 1 = Normal, 2 = Above Normal, 3 = Well Above Normal
GlucoseLevel 1, 2, 3 1 = Normal, 2 = Above Normal, 3 = Well Above Normal
CardiovascularDisease 0, 1 0 = No, 1 = Yes
Table 3: Renaming levels for categorical variables

To rename levels:

  1. Go to the Data tab in Jamovi.

  2. Select the variable whose levels you want to rename.

  3. Edit the labels under “Levels.”

Figure 7: Rename category levels

4.2 Reorder category levels

When dealing with ordinal variables in Jamovi, it’s important to ensure that each level is not only properly named, but also placed in the correct order (i.e., from “lowest” or “least” to “highest” or “most”). This ordering allows Jamovi to correctly interpret the progression or ranking within the variable. Below is an example table summarizing how you might order the levels of the ordinal variables:

Variable Original Levels Ordered Levels
CholesterolLevel 1, 2, 3 1 = Normal → 2 = Above Normal → 3 = Well Above Normal
GlucoseLevel 1, 2, 3 1 = Normal → 2 = Above Normal → 3 = Well Above Normal
Table 4: Ordering levels for ordinal variables

How to set the order in Jamovi:

  1. Go to the Data tab.

  2. Select the variable (e.g., cholesterol).

  3. Under Measure type, set it to Ordinal.

  4. Adjust the Levels so they appear in ascending or logical order (e.g., Normal, Above Normal, Well Above Normal).

Figure 8: Reorder levels of an ordinal variable

By setting the appropriate order, you ensure that statistical tests will treat these variables as ordinal rather than nominal, preserving the meaningful ranking in your analysis.

Step 5. Recode variables

Sometimes you need to create new variables or modify existing variables to perform your analyses effectively. In Jamovi, you can compute, transform, or recode variables in a few easy steps using the Compute or Transform option. Below are some examples based on the mlbootcamp5_train.xlsx dataset.

Why Compute or Recode Variables?

  • Improve Clarity: Turning numeric codes into descriptive categories makes your analysis more understandable.

  • Enable Proper Analysis: Many statistical models or visualizations require continuous vs. categorical variables or well-defined groups.

  • Focus on Research Questions: Grouping and transforming data can help isolate the variables most relevant to your analysis goals.

By thoughtfully computing, transforming, and recoding variables, you can tailor your dataset to the exact questions you’re asking—leading to clearer, more insightful results.

5.1 Numeric to numeric (from one to one)

If your dataset stores the “age” variable in days (e.g., Age_days), you can convert it to years in Jamovi by creating a new Computed variable:

\[\text{Age\_years} = \frac{\text{Age\_days}}{365}\]

Converting age from days to years makes it more intuitive for analysis (e.g., categorizing individuals by age groups). This is an example of data transformation, where a new numeric variable is created from a single numeric variable in the dataset.

Steps in Jamovi

  1. Go to the Data tab.

  2. Select the variable Age_days.

  3. At the top, click the Compute icon (the calculator symbol).

  4. In the new compute panel, name your new variable (e.g., Age_years).

  5. In the Formula box, type the expression: Age_days/365.

A new variable (e.g., Age_years) will appear in your dataset, reflecting the participant’s age in years instead of days.

Figure 9: Insert new computed variable

5.2 Numeric to numeric (from many to one)

A common calculation is the Body Mass Index (BMI), which you can compute using:

\[\text{BMI} = \frac{\text{weight in kilograms}}{(\text{height in meters})^2}\]

Since height in the dataset is in centimeters, remember to convert it to meters before calculating BMI.

This is an example of a derived numeric variable, which is a variable that is calculated from many other numeric variables in the dataset.

Steps in Jamovi

  1. Go to the Data tab.

  2. Select the variable Weight_kg.

  3. At the top, click the Compute icon (the calculator symbol).

  4. In the new compute panel, name your new variable (e.g., BMI).

  5. In the Formula box, type the expression: Weight_kg/(Height_cm/100)^2.

Figure 10: Compute BMI

5.3 Numeric to categorical

5.3.1. Combined Indicator

You could also create a combined variable that indicates potential hypertension if both Systolic (BloodPressure_Systolic) and Diastolic (BloodPressure_Diastolic) blood pressure measurements exceed certain thresholds. For instance, you might label a person as High_BP if:

  • Systolic \(\geq\) 140 AND Diastolic \(\geq\) 90

Otherwise, label them as Normal In Jamovi:

Steps in Jamovi

  1. Go to the Data tab.

  2. Select the variable BloodPressure_Diastolic.

  3. At the top, click the Compute icon (the calculator symbol).

  4. In the new compute panel, name your new variable (e.g., Hypertension)

  5. In the Formula box, type the expression: IF(BloodPressure_Systolic >= 140 and BloodPressure_Diastolic >= 90,'High_BP', 'Normal')

Figure 11: Compute a new catrgorical variable with the `IF()` function

5.3.2 Recoding a Numeric Variable into Categories

If you want to group participants into age brackets (e.g., “Young,” “Middle-aged,” “Older”), you could recode the Age_years variable into categories. For instance, you can recode ago groups:

  1. 0–35 years as “Young”

  2. 36–55 years as “Middle-aged”

  3. Over 55 as “Older”

Steps in Jamovi

  1. Go to the Data tab.

  2. Select the variable Age_years.

  3. At the top, click the Transform icon.

  4. Choose Create new Transform… item from using transform list Figure 12.

  5. Rename the Transform as `Age Group`, then click on Add recode condition button twice.

  6. Fill in the condition boxes as you can see in Figure 13

  7. Fill in the Variable suffix field with the text ..._cat.

  8. Close the Transform and Transformd Variable panes using the arrows.

Figure 12: Create new tranform
Figure 13: Add Recode Conditions

Step 6. Filter dataset

The mlbootcamp5_train.xlsx dataset contains health and demographic information for 70,000 participants. However, some entries may contain inaccuracies or “dirt” that can skew your analysis. We’ll filter out the following erroneous patient segments:

  1. Diastolic pressure higher than systolic pressure.

  2. Height below the 2.5th percentile or above the 97.5th percentile.

  3. Weight below the 2.5th percentile or above the 97.5th percentile.

After filtering, we’ll calculate the percentage of data removed.

Steps in Jamovi

  1. Calculate Percentiles for Height and Weight

    To filter out extreme values, we’ll determine the 2.5th and 97.5th percentiles for both height and weight.

    1. Navigate to Descriptives:

      • Click on Analyses in the top menu.

      • Select Exploration and then Descriptives.

    2. Calculate Percentiles for Height:

      • Variables: Drag Height_cm and Weight_kg to the Variables box.

      • Statistics:

        • In the Statistics section, check Percentile Values.

        • In Percentiles field enter 2.5 and 97.5 to calculate the 2.5th and 97.5th percentiles .

      • Note the Values: Record the 2.5th and 97.5th percentile values for height and weight from Result pane (Figure 14).

  2. Compute a new Dummy Variable for erroneous cases

    1. Go to Data tab.

    2. Select the variable Hypertension.

    3. At the top, click the Compute icon (the calculator symbol).

    4. In the new compute panel, name your new variable (e.g., Erroneous_Data)

    5. In the Formula box, type the expression ( Figure 15 ):
      IF((BloodPressure_Diastolic > BloodPressure_Systolic) or (Height_cm < 150) or (Height_cm > 180) or (Weight_kg < 51) or (Weight_kg > 108),'Erroneous', 'Valid')

  3. Calculate Percentage Removed

    1. Navigate to Descriptives:

      • Click on Analyses in the top menu.

      • Select Exploration and then Descriptives.

    2. Frequency Table for Errorneous Data

      • Variables: Drag Errorneous_Data the Variables box.

      • Frequencies:

        • Check on the Frequency Table checkbox (Figure 16).

        • Almost 10% of the dataset will be filtered out

  4. Filter out errorneous data

    1. Go to Data tab.

    2. At the top, click the Filters icon.

    3. In the new compute panel, name your new variable (e.g., Erroneous_Data)

    4. In the Filter 1 box, type the expression (Figure 17):
      Erroneous_Data == 'Valid'

Figure 14: Percentiles for Height and Weight
Figure 15: Compute a Dummy Variable for filter data
Figure 16: Frequency table for Errorneous Data
Figure 17: Filter out Errorneous data

Step 7. Univariate Descriptive Statistics

Univariate descriptive statistics involve summarizing and describing the main features of a single variable. These statistics provide simple summaries about the sample and the measures, offering insights into the data’s central tendency, variability, distribution shape, and frequency. Mastering these techniques in Jamovi will equip you with the foundational skills necessary for effective data analysis in psychological research.

7.1 Measures of Central Tendency: Mean, Median, and Mode

Objective

Calculate the central tendency measures—Mean, Median, and Mode—for continuous variables such as Height_cm and Weight_kg.

Steps in Jamovi

  1. Navigate to Descriptives:

    • Click on AnalysesExplorationDescriptives.
  2. Select Variables:

    • Drag Height_cm and Weight_kg into the “Variables” box.
  3. Choose Statistics:

    • In the “Statistics” section, ensure “Mean”, “Median”, and “Mode” are checked.

Interpreting Results

  • Mean: The average value, providing a measure of central tendency.

  • Median: The middle value when data is ordered, less affected by outliers.

  • Mode: The most frequently occurring value.

Example Interpretation

  • Height_cm: Mean = 164.49 cm, Median = 165 cm, Mode = 165 cm.

  • Weight_kg: Mean = 73.53 kg, Median = 72 kg, Mode = 65 kg.

Figure 18: Measures of Central Tendency: Mean, Median, and Mode

7.2 Measures of Variability: Standard Deviation and Variance

Objective

Assess the spread or dispersion of continuous variables using Standard Deviation and Variance.

Steps in Jamovi

  1. Navigate to Descriptives:

    • Click on AnalysesExplorationDescriptives.
  2. Select Variables:

    • Drag Height_cm and Weight_kg into the “Variables” box.
  3. Choose Statistics:

    • In the “Statistics” section, check “Std. deviation” and “Variance”.

Interpreting Results

  • Standard Deviation (SD): Indicates the average distance of data points from the mean.

  • Variance: The square of the standard deviation, representing data dispersion.

Example Interpretation

  • Height_cm: SD = 6.86 cm, Variance = 47.12 cm².

  • Weight_kg: SD = 11.91 kg, Variance = 141.95 kg².

Figure 19: Measures of Variability: Standard Deviation and Variance

7.3 Range and Interquartile Range (IQR)

Objective

Determine the Range and Interquartile Range (IQR) to understand the data’s spread and the middle 50% distribution.

Steps in Jamovi

  1. Navigate to Descriptives:

    • Click on AnalysesExplorationDescriptives.
  2. Select Variables:

    • Drag Height_cm and Weight_kg into the “Variables” box.
  3. Choose Statistics:

    • In the “Statistics” section, check “Minimum”, “Maximum”, “Range”, and “IQR”.

Interpreting Results

  • Range: Difference between the maximum and minimum values.

  • IQR: Difference between the 75th percentile (Q3) and the 25th percentile (Q1), representing the middle 50% of data.

Example Interpretation

  • Height_cm: Range = 30 cm (150 cm to 180 cm), IQR = 9 cm.

  • Weight_kg: Range = 57 kg (51 kg to 108 kg), IQR = 16 kg.

7.4 Frequency Distributions for Categorical Variables

Objective

Summarize categorical variables by calculating the frequency and percentage of each category, such as Gender, SmokingStatus, and CholesterolLevel.

Steps in Jamovi

  1. Navigate to Descriptives:

    • Click on AnalysesExplorationDescriptives.
  2. Select Variables:

    • Drag Gender, SmokingStatus, and CholesterolLevel into the “Variables” box.
  3. Choose Statistics:

    • Ensure “Frequency tables” is checked.

    • Optionally, check “Bar plot” under “Plots” for visual representation.

Interpreting Results

  • Frequency Tables: Show the count and percentage of each category within a variable.

  • Bar plot: Visual representation of the frequency distribution.

Example Interpretation:

  • Gender:

    • Female: 41,334 (65%)

    • Male: 21,925 (35%)

    • The dataset comprises 65% female and 35% male participants. This indicates a predominance of female participants in the study.

  • SmokingStatus:

    • Non-smoker: 57,800 (91%)

    • Smoker: 5,459 (9%)

    • A vast majority of participants are non-smokers (91%), while smokers constitute only 9% of the sample.

  • CholesterolLevel:

    • Normal: 47,719 (75%)

    • Above Normal: 8,428 (13%)

    • Well Above Normal: 7,112 (11%)

    • The distribution of cholesterol levels shows that:

      • 75% of participants have normal cholesterol levels.

      • 13% have above normal levels.

      • 11% have well above normal levels.

Figure 20: Frequency Distributions for Categorical Variables

7.4 Percentiles and Quartiles

Objective

Identify specific data points within a distribution by calculating Percentiles and Quartiles for continuous variables.

Steps in Jamovi

  1. Navigate to Descriptives:

    • Click on AnalysesExplorationDescriptives.
  2. Select Variables:

    • Drag Height_cm and Weight_kg into the “Variables” box.
  3. Choose Statistics:

    In the “Statistics” section, check “Percentile Values”.

    • Check “Cut points for 4 equal groups” for quartiles ( 25th, 50th, and 75th percentiles)

    • Check “Percentiles” and enter desired percentiles, e.g., specific percentiles like 2.5, 97.5 (or 25, 50, 75 for quartiles).

Interpreting Results

  • Quartiles:

    • Q1 (25th percentile): 25% of data falls below this value.

    • Q2 (50th percentile/Median): 50% of data falls below this value.

    • Q3 (75th percentile): 75% of data falls below this value.

  • Specific Percentiles: Useful for identifying outliers or specific data points.

Example Interpretation:

  • Height_cm:

    • Q1 = 160 cm
      25% of participants have a height below 160 cm. This value marks the lower quartile of the height distribution.

    • Median = 165 cm
      The median height is 165 cm, meaning that 50% of the participants are shorter than 165 cm, and 50% are taller.

    • Q3 = 169 cm
      75% of participants have a height below 169 cm. This value represents the upper quartile of the height distribution.

  • Weight_kg:

    • 2.5th Percentile = 54 kg
      2.5% of participants weigh less than 54 kg. This value marks the lower end of the weight distribution.

    • 97.5th Percentile = 100 kg
      97.5% of participants weigh less than 100 kg, meaning 2.5% weigh 100 kg or more. This value marks the upper end of the weight distribution.

Figure 21: Percentiles and Quartiles

7.5 Skewness and Kurtosis

Objective

Assess the Skewness (asymmetry) and Kurtosis (tailedness) of continuous variables to understand their distribution shapes.

Steps in Jamovi

  1. Navigate to Descriptives:

    • Click on AnalysesExplorationDescriptives.
  2. Select Variables:

    • Drag Height_cm and Weight_kg into the “Variables” box.
  3. Choose Statistics:

    • In the “Statistics” section, check “Skewness” and “Kurtosis”.

Interpreting Results

  • Skewness:

    • Positive Skew: Tail extends to the right; mean > median.

    • Negative Skew: Tail extends to the left; mean < median.

    • Zero Skew: Symmetrical distribution.

  • Kurtosis:

-   **High Kurtosis**: Data have heavy tails; more outliers.

-   **Low Kurtosis**: Data have light tails; fewer outliers.

-   **Mesokurtic**: Normal distribution kurtosis.

Example Interpretation

  • Height_cm

    • Skewness (0.06):

      • Nearly Symmetrical: A skewness value close to 0 indicates that the distribution of Height_cm is nearly symmetrical.

      • Minimal Asymmetry: With a skewness of 0.06, there is minimal right skew (slightly longer tail on the right), but it’s practically negligible.

    • Kurtosis (-0.58):

      • Platykurtic Distribution: A kurtosis value of -0.58 suggests that the distribution of Height_cm is platykurtic, meaning it is flatter than a normal distribution.

      • Light Tails: The data has lighter tails, indicating fewer outliers compared to a normal distribution.

  • Weight_kg

    • Skewness (0.55):

      • Moderate Right Skew: A skewness value of 0.55 indicates a moderate positive skew, meaning the distribution of Weight_kg has a longer tail on the right side.

      • Mass Concentration: More participants have weights below the mean, with fewer individuals having significantly higher weights.

    • Kurtosis (-0.19):

      • Slightly Platykurtic: A kurtosis of -0.19 suggests that Weight_kg is slightly platykurtic, exhibiting a bit flatter than a normal distribution.

      • Light Tails: Similar to Height_cm, there are fewer extreme outliers.

Figure 22: Skewness and Kurtosis

7.6 Visualizing Data: Histograms and Boxplots

Objective

Create visual representations of data distributions to complement numerical descriptive statistics.

Steps in Jamovi

  1. Navigate to Descriptives:

    • Click on AnalysesExplorationDescriptives.
  2. Select Variables:

    • Drag Height_cm and Weight_kg into the “Variables” box.
  3. Choose Plots:

    • Under the “Plots” section, check “Histogram” and “Box plot”.

Interpreting Results

  • Histogram:

    • Displays the frequency distribution of the selected variable.

    • Shape: Observe the distribution’s skewness, modality, and presence of outliers.

  • Boxplot:

    • Visualizes the median, quartiles, and potential outliers.

    • Insights: Identify data symmetry, skewness, and extreme values.

Example Interpretation

  • Height_cm - Histogram. The histogram provides a visual representation of the distribution of height (Height_cm) in the dataset. Here’s how to interpret this histogram step by step:

    • Shape of the Distribution

      • The histogram appears approximately symmetrical, with the highest bar (peak) around 170 cm.

      • The shape resembles a bell curve, indicating that the data may follow a normal distribution, though there are slight variations in bar heights.

    • Central Tendency

      • The peak (mode) of the distribution occurs near 170 cm, suggesting that this is the most common height range.

      • Since the histogram is symmetrical, the mean and median are likely close to 170 cm as well.

    • Spread and Range

      • The heights range roughly from 150 cm to 180 cm. Most of the data falls between 160 cm and 170 cm, with fewer individuals at the extremes (150 cm or 180 cm).
    • Density and Frequency

      • The y-axis (density) indicates the relative frequency of heights in the dataset:

        • Taller bars represent intervals with more individuals, such as the bar around 170 cm.

        • Shorter bars represent intervals with fewer individuals, such as at the edges near 150 cm and 180 cm.

    • Skewness

      • The histogram shows minimal skewness, meaning that the data is nearly symmetrical. There’s no prominent tail on either the left or right, which supports the interpretation of a normal distribution.
    • Outliers

      • The histogram does not show any extreme values or outliers; most of the data is concentrated within a reasonable range of heights.
  • Weight_kg - Histogram. This histogram provides a visual representation of the distribution of weight (Weight_kg) in the dataset. Here’s how to interpret it step by step:

    • Shape of the Distribution

      • The histogram is asymmetrical, with a longer tail to the right. This indicates a positive skew (right skew), meaning there are relatively few individuals with higher weights compared to the majority of the sample.
    • Central Tendency

      • The peak (mode) of the distribution is around 60–65 kg, suggesting that this is the most common weight range in the dataset.
      • The mean is likely greater than the mode due to the positive skew, as the higher weights pull the mean toward the right.

      • The median would fall between the mode and the mean, closer to the bulk of the data.

    • Spread and Range

      • The weights range approximately from 50 kg to 110 kg.
      • Most of the data is concentrated between 60 kg and 80 kg, with relatively fewer individuals below 60 kg or above 100 kg.
    • Density and Frequency

      • The y-axis (density) represents the relative frequency of individuals within weight intervals:

        • The highest bars, around 60–65 kg, indicate that a large proportion of participants fall within this weight range.

        • The shorter bars toward the higher end (above 100 kg) indicate fewer participants in these weight ranges.

    • Skewness

      • The histogram exhibits a moderate positive skew, with a longer tail extending to the right.

      • This suggests that while the majority of individuals fall within a relatively narrow range of weights, there are a few participants with much higher weights.

    • Outliers

      • The right tail suggests the presence of potential outliers among individuals with higher weights (above 100 kg). These outliers could disproportionately affect statistical analyses, particularly those involving means.
  • Height_cm - Boxplot. This is a boxplot for Height_cm, a graphical representation of the distribution of height in the dataset. Here’s how to interpret it:

    • Components of the Boxplot

      • Box:

        • Represents the interquartile range (IQR), which contains the middle 50% of the data.

        • The bottom edge of the box is the 25th percentile (Q1), and the top edge is the 75th percentile (Q3).

        • The line inside the box is the median (Q2), indicating the middle value of the dataset.

      • Whiskers:

        • Extend to the smallest and largest values within 1.5 times the IQR from Q1 and Q3.

        • These represent the spread of the data, excluding potential outliers.

      • Potential Outliers:

        • Points beyond the whiskers (not shown here) would be considered outliers.
    • Key Observations

      • Median (Q2):

        • The median line is located slightly above the center of the box, suggesting that the distribution of height is approximately symmetric, with a small possible skew toward higher values.
      • Interquartile Range (IQR):

        • The height of the box (distance between Q1 and Q3) reflects the spread of the middle 50% of the data. This range is relatively narrow, indicating moderate variability in participants’ heights.
      • Whiskers:

        • The whiskers extend to the minimum (~150 cm) and maximum (~180 cm) heights within 1.5 × IQR.

        • There are no visible outliers, as no points are shown outside the whiskers.

    • Distribution Characteristics

      • Symmetry:

        • The box and whiskers are roughly balanced, with the median near the center of the box, suggesting that the distribution of height is close to normal.
      • Range:

        • The total range of the data (from the bottom whisker to the top whisker) is approximately 150 cm to 180 cm.
      • Central Tendency:

        • The median indicates the central value of height, likely close to 165 cm (based on earlier statistics).
    • Conclusion

      • This boxplot for Height_cm shows a symmetrical distribution with no outliers and moderate variability. The data appears to be normally distributed, making it well-suited for parametric statistical analyses. The majority of participants’ heights fall between 160 cm and 170 cm, with a total range from 150 cm to 180 cm.
  • Weight_kg - Boxplot. This is a boxplot for Weight_kg, showing the distribution of weight in the dataset. Here’s how to interpret it:

    • Components of the Boxplot

      • Box:

        • Represents the interquartile range (IQR), which contains the middle 50% of the data.

        • The bottom edge of the box corresponds to the 25th percentile (Q1).

        • The top edge of the box corresponds to the 75th percentile (Q3).

        • The line inside the box is the median (Q2), indicating the middle value of the dataset.

      • Whiskers:

        • Extend to the smallest and largest values within 1.5 times the IQR from Q1 and Q3.

        • These represent the range of the data, excluding potential outliers.

      • Outliers:

        • Dots above the top whisker indicate outliers, which are values greater than 1.5 × IQR above Q3.
    • Key Observations

      • Median (Q2):

        • The median is slightly below the center of the box, suggesting that the distribution is slightly right-skewed (positive skew).
      • Interquartile Range (IQR):

        • The height of the box reflects the spread of the middle 50% of the data, with most participants weighing between approximately 65 kg and 80 kg.
      • Whiskers:

        • The whiskers extend downward to approximately 55 kg and upward to approximately 100 kg, indicating the range of most weights without considering outliers.
      • Outliers:

        • Several dots above the top whisker indicate higher weights (above 100 kg) that are considered outliers.

        • These outliers represent participants with significantly higher weights compared to the majority of the dataset.

    • Distribution Characteristics

      • Skewness:

        • The median’s position below the center of the box and the presence of outliers at the upper end suggest a moderate positive skew, with a longer tail extending toward higher weights.
      • Variability:

        • The IQR (distance between Q1 and Q3) is moderate, indicating that the middle 50% of weights are relatively close in range, but the presence of outliers increases the overall variability.
      • Outliers:

        • The outliers (above 100 kg) may disproportionately affect the mean weight, making the median a better measure of central tendency for this variable.
    • Implications for Analysis

      • Normality:

        • The positive skew and presence of outliers indicate that the weight distribution may not follow a normal distribution, which could affect statistical analyses requiring normality.
      • Outliers:

        • The outliers may need to be examined further to determine whether they are valid data points, measurement errors, or extreme but meaningful values.
      • Central Tendency:

        • Given the skewness, the median weight is likely more representative of the central tendency than the mean.
    • Conclusion

      • The boxplot for Weight_kg shows a moderately right-skewed distribution with several outliers above 100 kg. Most participants weigh between 65 kg and 80 kg, with a range extending from approximately 55 kg to 100 kg. The skewness and outliers suggest that parametric analyses requiring normality should be applied cautiously, and non-parametric methods or data transformation may be more appropriate.
Figure 23: Visualizing Data: Histograms and Boxplots

Step 8. Univariate Inferential Statistics

Univariate hypothesis tests are statistical procedures used to analyze the differences pertaining to a single variable. These tests help determine whether a variable significantly differs from a hypothesized value or distribution.

We will explore various univariate hypothesis tests using the mlbootcamp5_train.xlsx dataset in Jamovi. Each test includes a clear hypothesis, step-by-step instructions for execution in Jamovi, and guidance on interpreting the results.

8.1 One-Sample t-Test: Testing Mean Age

Objective

Determine if the average age of participants differs significantly from 30 years.

Hypotheses

  • Null Hypothesis (\(H_0\)): The mean age is equal to 30 years.
    \(H_0: \mu = 30\)

  • Alternative Hypothesis (\(H_1\)): The mean age is not equal to 30 years.
    \(H_0: \mu \neq 30\)

Steps in Jamovi

  1. Perform One-Sample t-Test:

    • Navigate to Analyses → T-Tests → One-Sample T-Test.

    • Variables: Move Age_years to the Variables box.

    • Test Value: Enter 30.

    • Options: Check Descriptives and Descriptive plots if desired.

Interpreting Results

  • Descriptives: Review the mean, standard deviation, and sample size.

  • t-Test Results:

    • t-value: Indicates the ratio of the difference between sample mean and test value to the standard error.

    • Degrees of Freedom (df): Typically \(n-1\).

    • p-value: If \(p<0,05\), reject the null hypothesis.

Example Interpretation: Since \(p<0,001\) , conclude that the average age significantly differs from 30 years.

Figure 24: One-Sample t-Test: Testing Mean Age

8.2 Binomial test (Proportion Test): Gender Distribution

Objective

Assess whether the gender distribution of participants deviates from an expected 50% Female and 50% Male distribution.

The binomial test is used to determine whether the proportion of two categories in a binary variable (e.g., gender) deviates from an expected proportion. In this case, the question asks whether the gender distribution of participants deviates from an expected 50% Female and 50% Male distribution.

Hypotheses

  • Null Hypothesis (\(H_0\)) : Gender distribution is 50% Female and 50% Male.
    \(H_0: P(Female)=0.5, P(Male)=0.5\)

  • Alternative Hypothesis (\(H_1\)): Gender distribution is not 50% Female and 50% Male.
    \(H_1: P(Female) \neq 0.5, P(Male) \neq 0.5\)

Steps in Jamovi

  1. Navigate to Chi-Square Test:

    • Go to Analyses → Frequency → 2 Outcomes - Binomial test
  2. Set Up the Test:

    • Variables: Drag Gender to the Variables box.

    • Test value: Enter 0.5.

Interpreting Results

  • Observed Proportions: Displays the proportion of females and males in the sample (e.g., Female = 65%, Male = 35%).

  • p-value: If p<0.05p < 0.05, reject the null hypothesis.

Example Interpretation

  • Hypothesis

    • Null Hypothesis (\(H_0\)): The proportion of females and males matches the expected proportion of 50% Female and 50% Male**.
    • Alternative Hypothesis (\(H_A\)): The proportion of females and males does not match the expected 50%-50% distribution.
  • Key Results

    • Counts and Proportions

      • Female:

        • Count: 41,334 participants.

        • Total: 63,259 participants.

        • Proportion: \(\frac{41334}{63259} = 0.65\) (65% of the total sample are female).

      • Male:

        • Count: 21,925 participants.

        • Total: 63,259 participants.

        • Proportion: \(\frac{21925}{63259} = 0.35\) (35% of the total sample are male).

      These proportions indicate that there are significantly more females than males in the dataset.

    • p-Values

      • Both proportions for Female and Male have a p-value of < 0.0001.

        • This means that the observed proportions significantly deviate from the expected 50%-50% distribution.

        • In other words, the null hypothesis (\(H_0\)​) is rejected.

    • Statistical Interpretation

      • The proportion of females (65%) is significantly higher than the expected 50%.

      • Similarly, the proportion of males (35%) is significantly lower than the expected 50%.

      • Since both p-values are < 0.0001, this deviation is highly statistically significant.

    • Practical Implications

      • Gender Imbalance: The dataset is skewed toward female participants (65%), indicating a notable gender imbalance.

      • Bias or Representation: This imbalance could reflect:

        • Sampling bias (e.g., the dataset was collected in a way that over-represented females).

        • Real-world trends (e.g., more females were willing or available to participate in the study).

      • Generalizability: The findings from this dataset might be more reflective of the female population and less representative of the male population.

    • Reporting Results

      • A binomial test was conducted to assess whether the gender distribution of participants deviated from an expected 50% Female and 50% Male distribution. The results showed a significant deviation ($p<0.0001$), with 65% of participants being female ($n=41,334$) and 35% being male (n=21,925). These findings indicate a significant gender imbalance in the dataset.
Figure 25: Binomial test (Proportion Test): Gender Distribution

8.3 Chi-Square Goodness-of-Fit Test: Cholesterol Level Distribution

Objective

Evaluate whether the distribution of cholesterol levels matches the expected distribution:

  • Normal: 50%

  • Above Normal: 30%

  • Well Above Normal: 20%

Hypotheses

  • Null Hypothesis (\(H_0\)): The cholesterol level distribution matches the expected proportions.
    $H_0: P(\text{Normal}) = 0.50, P(\text{Above Normal}) = 0.30, P(\text{Well Above Normal}) = 0.20$

  • Alternative Hypothesis (H₁): The distribution does not match the expected proportions.
    $H_1$: At least one proportion differs.

Steps in Jamovi

  1. Navigate to Chi-Square Test:

    • Go to Analyses → Frequency → N Outcomes - \(\chi^2\) Godness of fit
  2. Set Up the Test:

    • Variables: Drag CholesterolLevel to the Variables box.

    • Expected Counts: Check the Expected counts checkbox.

    • Expected Proportions:

      • Enter 5 for Normal, 3 for Above Normal, and 2 for Well Above Normal.

      • The test uses expected proportions based on the given ratios:

        • Normal: Ratio = 5 → Proportion = 0.50 (50% expected)

        • Above Normal: Ratio = 3 → Proportion = 0.30 (30% expected)

        • Well Above Normal: Ratio = 2 → Proportion = 0.20 (20% expected)

Interpreting Results

  • Observed vs. Expected Counts: Compare actual counts to expected counts based on proportions.

  • Chi-Square Statistic: Higher values indicate greater discrepancy.

  • p-value: If p<0.05p < 0.05, reject the null hypothesis.

Example Interpretation

  • Chi-Square Test Results

    • Chi-Square Statistic (\(\chi^2\))

      • The \(\chi^2\) statistic is 16,474.78. This value measures how much the observed counts deviate from the expected counts.
    • Degrees of Freedom (\(df\))

      • Degrees of freedom (\(df\)): \(k−1=3−1=2\), where \(k\) is the number of categories.
    • p-Value

      • p < 0.0001: The p-value is highly significant, meaning that the observed proportions deviate significantly from the expected proportions.
  • Interpretation

    • The observed proportions (75% Normal, 13% Above Normal, 11% Well Above Normal) significantly differ from the expected proportions (50%, 30%, 20%).

    • The null hypothesis is rejected (\(p<0.0001\)), indicating that the distribution of cholesterol levels in the dataset does not match the expected distribution.

  • Practical Implications

    • Normal Cholesterol Levels: The observed proportion (75%) is significantly higher than the expected proportion (50%), suggesting that a larger percentage of participants have normal cholesterol than anticipated.

    • Above Normal and Well Above Normal Levels: Both categories are underrepresented compared to the expected proportions (13% vs. 30% and 11% vs. 20%, respectively). This could indicate that fewer participants have elevated cholesterol levels than expected based on the predefined ratios.

  • Reporting Results

    • A chi-square goodness-of-fit test was conducted to determine whether the observed distribution of cholesterol levels deviated from the expected proportions (50% Normal, 30% Above Normal, 20% Well Above Normal). The results were significant ($^2(2) = 16,474.78, p<0.0001$), indicating that the observed proportions (75% Normal, 13% Above Normal, 11% Well Above Normal) significantly differed from the expected proportions. This suggests that the sample has a higher proportion of individuals with normal cholesterol and fewer with elevated cholesterol than anticipated.
Figure 26: Chi-Square Goodness-of-Fit Test: Cholesterol Level Distribution

8.4 Normality of Height

Objective

Assess whether the distribution of participants’ heights is normally distributed.

Normality tests assess whether a variable (in this case, Height_cm) follows a normal distribution. This is important for determining whether parametric tests, which assume normality, can be applied.

This section presents the results of normality tests conducted on the variable Height_cm. Here’s a detailed interpretation of each test and the overall conclusions:

Hypotheses

  • Null Hypothesis (\(H_0\)): Heights are normally distributed.
    \(H_0:\text{Height} \sim \mathcal{N}(\mu, \sigma)\)

  • Alternative Hypothesis (H₁): Heights are not normally distributed.
    \(H_1:\text{Height} \not\sim \mathcal{N}(\mu, \sigma)\)

Steps in Jamovi

  1. Navigate to Descriptive Statistics:

    • Go to Analyses → Exploratory → Descriptives.

    • Variables: Drag Height_cm to the Variables box.

    • Under Statistics, check Normality - Shapiro-Wilk.

    • Under Plots, check Q-Q plots.

  2. If the moretest package is not installed, install it.

  3. Navigate to T-Tests:

    • Go to Analyses → T-Tests → One Sample T-Test.

    • Variables: Drag Height_cm to the Variables box.

    • Under Assumptions Checks, check Normality test and Q-Q Plot.

Interpreting Results

  • Shapiro-Wilk Test

    • Statistic: Not available (NaN).

    • Reason: The Shapiro-Wilk test was not calculated because the sample size exceeds 5,000 observations. Shapiro-Wilk is typically used for smaller datasets and is not computed for very large samples due to computational limitations and sensitivity to large sample sizes.

  • Kolmogorov-Smirnov Test

    • Statistic: (\(0.07\))

    • p-value: (\(< 0.0001\))

      • This indicates a significant result, meaning the distribution of Height_cm deviates from normality.
  • Anderson-Darling Test

    • Statistic: (\(164.11\))

    • p-value: (\(< 0.0001\))

      • This also shows a significant result, confirming that Height_cm does not follow a normal distribution.
  • QQ Plot

    • The QQ (Quantile-Quantile) plot is a graphical method to assess normality by comparing the distribution of the data to a theoretical normal distribution.

    • Observations from the QQ Plot:

      • Straight Line: The points closely follow a straight line, particularly in the central portion of the distribution, indicating that the middle range of the data aligns well with a normal distribution.

      • Deviations at the Tails: Some deviation is visible at the upper and lower tails (ends of the plot). This suggests potential minor deviations from normality, likely due to extreme values or outliers.

    • Normality:

      • While the Shapiro-Wilk test is not computed due to the large sample size, the QQ plot provides sufficient evidence that the data is approximately normal, with some minor deviations at the tails.

      • In large datasets, even small deviations from normality can appear statistically significant but may not be practically meaningful.

  • Interpretation of Results

    • Non-Normality: The tests suggest that Height_cm is not perfectly normally distributed. However, the deviation might not be practically significant depending on the shape of the distribution.
    • Parametric Tests: Despite the non-normality, parametric tests (e.g., t-tests, ANOVA) are robust to minor deviations from normality, particularly for large sample sizes, due to the Central Limit Theorem.
    • piro-Wilk Statistic (W): Measures the degree of normality.
  • Considerations for Large Sample Sizes

    • In datasets with large sample sizes:

      • Normality tests tend to detect even minor deviations from normality as statistically significant.

      • Visual inspection methods (e.g., histograms, QQ plots) should complement normality tests to assess practical (rather than statistical) normality.

  • Example Reporting

    • Normality tests were conducted to assess whether Height_cm follows a normal distribution. The Kolmogorov-Smirnov Test (\(D = 0.07\)), (\(p < 0.0001\)) and Anderson-Darling Test (\(A = 164.11\)), (\(p < 0.0001\)) both indicated significant deviations from normality. However, given the large sample size, visual inspection of the data is recommended to assess the practical implications of these findings. The QQ plot of the data reveals that the distribution closely follows a normal distribution, with minor deviations at th